Goto

Collaborating Authors

 human competitor


Alpha Excel Benchmark

Noever, David, McKee, Forrest

arXiv.org Artificial Intelligence

ABSTRACT This study presents a novel benchmark for evaluating Large Language Models (LLMs) using challenges derived from the Financial Modeling W orld Cup (FMWC) Excel competitions. W e introduce a methodology for converting 113 existing FMWC challenges into programm atically evaluable JSON formats and use this dataset to compare the performance of several leading LLMs. Our findings demonstrate significant variations in performance across different challenge categories, with models showing specific strengths in pattern recognition tasks but struggling with complex numerical reasoning. The benchmark provides a standardized framework for assessing LLM capabilities in realistic business - oriented tasks rather than abstract academic problems. This resear ch contributes to the growing field of AI benchmarking by establishing proficiency among the 1.5 billion people who daily use Mic rosoft Exc el as a meaningful evaluation metric that bridges the gap between academic AI benchmarks and practical business applications. INTRODUCTION The recent rapid advancement of Large Language Models (LLMs) has sparked interest in developing specialized benchmarks to evaluate their capabilities across various domains. While existing benchmarks often focus on natural language understanding, programmi ng, or reasoning abilities in abstract contexts, there remains a notable gap in benchmarks that assess performance on practical business tasks (Brown et al., 2020). Microsoft Excel, being one of the most widely used business software tools globally, presen ts an opportunity to create tasks that simultaneously test multiple dimensions of LLM capabilities, including numerical reasoning, pattern recognition, rule comprehension, file conversion, and problem - solving strategies. The Financial Modeling W orld Cup (FMWC), established in 2020, has emerged as a premier global competition testing advanced Excel skills through creative challenges that range from financial modeling to game simulations implemented in spreadsheets (Grigolyu novich, 2022).


Last Week in AI #177: OpenAI commercializes DALL-E 2, Sony AI beats human competitors in racing game, Gmail getting smarter searches, and more!

#artificialintelligence

Last week OpenAI moved DALL-E 2, the image generation tool, into Beta (the company hopes to expand its current user base to 1 million) while granting users the "the right to reprint, sell, and merchandise" images they generate with DALL-E. This is useful for users who wish to use the generated images for commercial purposes, like making illustrations for children's books. Other openly available AI image generation models face similar problems. Also, it's not clear if OpenAI violated any IP laws for just training on these Internet images and then commercializing their model. While the UK is exploring allowing commercial use of models trained on public but trademarked data, the U.S. may not follow suit.


Sony's racing AI destroyed its human competitors by being nice (and fast)

MIT Technology Review

But Sony soon learned that speed alone wasn't enough to make GT Sophy a winner. The program outpaced all human drivers on an empty track, setting superhuman lap times across three different virtual courses. Yet when Sony tested GT Sophy in a race against multiple human drivers, where intelligence as well as speed is needed, GT Sophy lost. The program was at times too aggressive, racking up penalties for reckless driving, and too timid, giving way when it didn't need to. Sony regrouped, retrained its AI, and set up a rematch in October.


'WebCrow 2.0' AI can solve crosswords in two languages

Engadget

Crossword puzzles aren't always easy to solve even for the most avid human fans, and they also remain one of the most challenging areas in artificial intelligence. Now, the University of Siena in Italy and expert.ai WebCrow 2.0 uses natural language processing technology to understand a puzzle's clues like a human player would. That's trickier than it sounds, seeing as the same word could mean totally different things based on context, and crossword puzzle clues could contain a play on words. The answer for the clue "liquid that does not stick," for instance, is "scotch," which alludes to Scotch tape. Plus, the AI derives information from previously solved puzzles and its self-updating web knowledge to find the correct answer.

  Country: Europe > Italy (0.27)
  Industry: Leisure & Entertainment > Games (0.91)

IMO Grand Challenge

#artificialintelligence

The International Mathematical Olympiad (IMO) is perhaps the most celebrated mental competition in the world and as such is among the ultimate grand challenges for Artificial Intelligence (AI). The challenge: build an AI that can win a gold medal in the competition. To remove ambiguity about the scoring rules, we propose the formal-to-formal (F2F) variant of the IMO: the AI receives a formal representation of the problem (in the Lean Theorem Prover), and is required to emit a formal (i.e. We are working on a proposal for encoding IMO problems in Lean and will seek broad consensus on the protocol. Each proof certificate that the AI produces must be checkable by the Lean kernel in 10 minutes (which is approximately the amount of time it takes a human judge to judge a human's solution).


AI is the Future Cyber Weapon of Internet Criminals

#artificialintelligence

Just last month, Tesla CEO and tech billionaire Elon Musk aired his sentiments about the threats that AI poses to humans. It's pretty alarming, but some experts assured the public that'AI Doomsday' is still far from happening. However, that doesn't mean that we are safe from online hackers who might use AI as their future cyber weapon. Still, just like the other technological advancements that we have right now, artificial intelligence can be exploited by criminals to carry out their evil work. Almost every week, news of hacking incidents come from all corners of the world.


Has AI passed a new milestone? It's beaten human players at poker, say researchers ZDNet

#artificialintelligence

Beating expert poker players differs from past AI successes against human competitors in games such as Jeopardy and Go. Researchers behind a poker-playing AI system called DeepStack say it's the first algorithm to have ever beaten poker pros in heads-up no-limit Texas hold'em. The claim, if verified, would mark a major milestone in the development of artificial-intelligence systems. Beating expert poker players differs from past AI successes against human competitors in games such as Jeopardy and Go because each player's hand provides only an incomplete picture about the state of play and requires a program to navigate tactics, such as bluffing, based on asymmetrical information. DeepStack is the work of a collaboration between researchers at the University of Alberta and two Czech universities, who say in a new non-peer reviewed paper that it's the "first computer program to beat professional poker players in heads-up no-limit Texas hold'em".